Unsupervised Approach of Data Selection for Language Model Adaptation using Generalized Word Posterior Probability

نویسندگان

  • Xinhui Hu
  • Shigeki Matsuda
  • Hideki Kashioka
چکیده

This paper reports an unsupervised approach toward data selection for language model adaptation that is used for improving spontaneous speech recognition in a speech-tospeech translation (S2ST) system. The approach is characterized by the following: 1) it obtains speech data from a real environment (sightseeing sites), in the travel domain, (2) it utilizes the recognition results of the above collected speech for the language model adaptation, (3) it applies generalized word posterior probability (GWPP) among the N-best recognition hypotheses for the base of an utterance confidence measure to select adaptation utterances, (4) it utilizes a collected proper noun lexicon to the baseline language model in the form of zeroton event, so that it has ability to recognize new proper noun words that are previously not contained in the recognition lexicon. By experiments on a Chinese speech test collected from a set of field experiments at five sightseeing areas in Japan, using the above adapted language model, average absolute reductions of 7.6% of the character error rate (CER) were obtained, which is more than the baseline language model. This reduction is over 77% of the 9.8% reduction obtained by the supervised adaptation. By manually correcting a small amount of utterances that were not selected due to their low confidences, and adding them to the above adaptation data, nearly 83% of the reduction by the supervised method can be achieved. The proposed approach effectively improves utterance selection, especially for those containing proper nouns, and is expected to reduce the cost of manual transcription.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain Adaptation of a Broadcast News Transcription System for the Portuguese Parliament

The main goal of this work is the adaptation of a broadcast news transcription system to a new domain, namely, the Portuguese Parliament plenary meetings. This paper describes the different domain adaptation steps that lowered our baseline absolute word error rate from 20.1% to 16.1%. These steps include the vocabulary selection, in order to include specific domain terms, language model adaptat...

متن کامل

Supervised and unsupervised Web-based language model domain adaptation

Domain language model adaptation consists in re-estimating probabilities of a baseline LM in order to better match the specifics of a given broad topic of interest. To do so, a common strategy is to retrieve adaptation texts from the Web based on a given domain-representative seed text. In this paper, we study how the selection of this seed text influences the adaptation process and the perform...

متن کامل

Language model adaptation using minimum discrimination information

In this paper, adaptation of language models using the minimum discrimination information criteria is presented. Language model probabilities are adapted based on unigram, bigram and trigram features using a modified version of the generalized iterative scaling algorithm. Furthermore, a language model compression algorithm, based on conditional relative entropy is discussed. It removes probabil...

متن کامل

Data point selection for cross-language adaptation of dependency parsers

We consider a very simple, yet effective, approach to cross language adaptation of dependency parsers. We first remove lexical items from the treebanks and map part-of-speech tags into a common tagset. We then train a language model on tag sequences in otherwise unlabeled target data and rank labeled source data by perplexity per word of tag sequences from less similar to most similar to the ta...

متن کامل

Language model parameter estimation using user transcriptions Citation

In limited data domains, many effective language modeling techniques construct models with parameters to be estimated on an in-domain development set. However, in some domains, no such data exist beyond the unlabeled test corpus. In this work, we explore the iterative use of the recognition hypotheses for unsupervised parameter estimation. We also evaluate the effectiveness of supervised adapta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011